library(pickupanalysis)The goal of this project is to learn ggplot2 visualisation for a spatial dataset and to deliver basic analytics for the same using R language. The motivation for working with this project is to get a holistic understanding of the usage of R language in analytics domain and also to develop a familiarity with ggplot2 visualisations. Analysing Uber’s pickup data for Bengaluru City using plots and maps provide a better understanding of the large dataset, compared to viewing data in the dataframe.
Two datasets are available with this package:-
“boundaries” : This dataset that contains the names and geometry of areas in Bangalore City dataset. This dataset is downloaded from the uber movement dataset that is available online. The dataset is slightly modified to remove redundant columns. Source: movement.uber.com
“date_hour” : This dataset provides the number of trips for the first quarter of 2020 for the city Bengaluru, India. This dataset is slightly modified for the purpose of effective storage into this package.
library(jsonlite) #Simple and fast json parser and generator for R
library(sf) #Support for simple features, a standardized way to encode spatial vector data
#> Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE
library(tidyverse) #collection of open source packages for the R programming language
#> -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
#> v ggplot2 3.3.6 v purrr 0.3.4
#> v tibble 3.1.7 v dplyr 1.0.9
#> v tidyr 1.2.0 v stringr 1.4.0
#> v readr 2.1.2 v forcats 0.5.1
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x purrr::flatten() masks jsonlite::flatten()
#> x dplyr::lag() masks stats::lag()
library(dplyr) #prominant data manipulation library
library(tidyr) #tidies messy data
library(ggplot2) #provides support for plots and visualisation
library(ggthemes) #provides themes for ggplots
library(lubridate) #helps wrangling with date, time format
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(DT) #helps project dataframes as html friendly datatables
library(scales) #automatically determines aesthetics for plots and maps
#>
#> Attaching package: 'scales'
#> The following object is masked from 'package:purrr':
#>
#> discard
#> The following object is masked from 'package:readr':
#>
#> col_factor
library(mapview) #provides create interactive visualisations of spatial data
library(ggpubr) #To create publication ready plots
library(viridis) #provides color scales for ggplots
#> Loading required package: viridisLite
#>
#> Attaching package: 'viridis'
#> The following object is masked from 'package:scales':
#>
#> viridis_pal
data(boundaries)
sf::sf_use_s2(FALSE) #Handles spherical geometry error
#> Spherical geometry (s2) switched off
boundaries$centroid <- st_centroid(boundaries$geometry)
#> Warning in st_centroid.sfc(boundaries$geometry): st_centroid does not give
#> correct centroids for longitude/latitude data
dim(boundaries)
#> [1] 198 4
mapview(boundaries)
lat_long_centroid <- boundaries$centroid
class(lat_long_centroid)
#> [1] "sfc_POINT" "sfc"
lat_long_mat <- st_coordinates(lat_long_centroid)
lat_long_tab <- as.data.frame(lat_long_mat)
boundaries$long <- lat_long_tab[,1]
boundaries$lat <- lat_long_tab[,2]
head(boundaries)
#> Simple feature collection with 6 features and 4 fields
#> Active geometry column: geometry
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: 77.53672 ymin: 12.9849 xmax: 77.76003 ymax: 13.14367
#> Geodetic CRS: WGS 84
#> # A tibble: 6 x 6
#> MOVEMENT_ID WARD_NAME geometry centroid
#> <chr> <chr> <MULTIPOLYGON [°]> <POINT [°]>
#> 1 1 Chowdeswari W~ (((77.59229 13.0972, 77.~ (77.58042 13.12171)
#> 2 2 Atturu (((77.56862 13.12705, 77~ (77.56004 13.1028)
#> 3 3 Yelahanka Sat~ (((77.59094 13.09842, 77~ (77.58393 13.09099)
#> 4 4 Vijnanapura (((77.67683 13.01147, 77~ (77.66957 13.00606)
#> 5 5 Basavanapura (((77.72899 13.02061, 77~ (77.71546 13.01685)
#> 6 6 Hudi (((77.73583 13.01279, 77~ (77.70549 13.02238)
#> # ... with 2 more variables: long <dbl>, lat <dbl>data(date_hour)
head(date_hour)
#> sourceid dstid month day start_hour end_hour
#> 1 102 97 3 13 10 16
#> 2 98 55 2 11 0 7
#> 3 148 111 1 11 16 19
#> 4 58 22 2 8 16 19
#> 5 54 62 1 18 16 19
#> 6 143 161 2 1 16 19date_hour$hour_add <- rowSums(date_hour[, c("start_hour", "end_hour")])
s <- unique(sort(date_hour$hour_add))
date_hour$Hour_slots <- as.factor(ifelse(date_hour$hour_add == s[1], 'Midnight to Early Morning',
ifelse(date_hour$hour_add == s[2], 'Morning Peak',
ifelse(date_hour$hour_add == s[3], 'After sunset',
ifelse(date_hour$hour_add == s[4], 'Mid-Day', 'Evening Peak')))))
date_hour$Hour_slots <- factor(date_hour$Hour_slots, levels=c("Midnight to Early Morning", "Morning Peak", "Mid-Day", "Evening Peak", "After sunset"))m <- unique(sort(date_hour$month))
date_hour$month_names <- as.ordered(ifelse(date_hour$month == m[1], 'January',
ifelse(date_hour$month == m[2], 'Febraury', 'March')))
date_hour$month_names <- factor(date_hour$month_names, levels=c("January", "Febraury", "March"))
date_hour$year <- as.numeric(2022)
date_hour$date_format <-as.Date(with(date_hour,paste(year,month,day,sep="-")),"%Y-%m-%d")
date_hour$day_of_week <- factor(wday(date_hour$date_format, label=TRUE))
date_hour <- date_hour[complete.cases(date_hour), ]
dim(date_hour)
#> [1] 7545777 12
head(date_hour)
#> sourceid dstid month day start_hour end_hour hour_add
#> 1 102 97 3 13 10 16 26
#> 2 98 55 2 11 0 7 7
#> 3 148 111 1 11 16 19 35
#> 4 58 22 2 8 16 19 35
#> 5 54 62 1 18 16 19 35
#> 6 143 161 2 1 16 19 35
#> Hour_slots month_names year date_format day_of_week
#> 1 Mid-Day March 2022 2022-03-13 Sun
#> 2 Midnight to Early Morning Febraury 2022 2022-02-11 Fri
#> 3 Evening Peak January 2022 2022-01-11 Tue
#> 4 Evening Peak Febraury 2022 2022-02-08 Tue
#> 5 Evening Peak January 2022 2022-01-18 Tue
#> 6 Evening Peak Febraury 2022 2022-02-01 Tue
jan_ban <- date_hour[date_hour$month == 1, ]
feb_ban <- date_hour[date_hour$month == 2, ]
mar_ban <- date_hour[date_hour$month ==3, ]
str(jan_ban)
#> 'data.frame': 2935699 obs. of 12 variables:
#> $ sourceid : int 148 54 59 95 140 99 52 94 87 152 ...
#> $ dstid : int 111 62 12 73 191 33 82 83 25 117 ...
#> $ month : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ day : int 11 18 18 31 11 31 18 31 26 29 ...
#> $ start_hour : int 16 16 16 19 16 19 16 19 10 10 ...
#> $ end_hour : int 19 19 19 0 19 0 19 0 16 16 ...
#> $ hour_add : num 35 35 35 19 35 19 35 19 26 26 ...
#> $ Hour_slots : Factor w/ 5 levels "Midnight to Early Morning",..: 4 4 4 5 4 5 4 5 3 3 ...
#> $ month_names: Ord.factor w/ 3 levels "January"<"Febraury"<..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ year : num 2022 2022 2022 2022 2022 ...
#> $ date_format: Date, format: "2022-01-11" "2022-01-18" ...
#> $ day_of_week: Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 2 3 2 3 2 4 7 ...daily_data_jan <- jan_ban %>% group_by(day) %>% dplyr::summarize(Total = n())
jan <- ggplot(daily_data_jan, aes(day, Total)) +
geom_bar(stat="identity",
fill="steelblue",
color="red") +
ggtitle("Trips Every Day in January", subtitle = "January days") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_y_continuous(labels=comma)
daily_data_feb <- feb_ban %>% group_by(day) %>% dplyr::summarize(Total = n())
feb <- ggplot(daily_data_feb, aes(day, Total)) +
geom_bar(stat="identity",
fill="steelblue",
color="red") +
ggtitle("Trips Every Day in Febraury", subtitle = "Febraury Days") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_y_continuous(labels=comma)
daily_data_mar <- mar_ban %>% group_by(day) %>% dplyr::summarize(Total = n())
mar <- ggplot(daily_data_mar, aes(day, Total)) +
geom_bar(stat="identity",
fill="steelblue",
color="red") +
ggtitle("Trips Every Day in March", subtitle = "March Days") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_y_continuous(labels=comma)
ggarrange(jan, feb, mar + rremove("x.text"),
ncol = 2, nrow = 2)Plot1: Total number of trips for each months in the first quarter
doubleplot <- date_hour %>% group_by(month_names, Hour_slots) %>% dplyr::summarize(Total = n())
#> `summarise()` has grouped output by 'month_names'. You can override using the
#> `.groups` argument.
ggplot(doubleplot, aes(Hour_slots, Total, fill=month_names)) +
geom_bar(stat = "identity") +
ggtitle("Trips by Hour Slots and Months") +
scale_y_continuous(labels = comma)Plot 2: Total number of trips by hourslots and months
day_hour <- date_hour %>% group_by(day, Hour_slots) %>% dplyr::summarize(Total = n())
#> `summarise()` has grouped output by 'day'. You can override using the `.groups`
#> argument.
datatable(day_hour)Data table that shows correlation between day of the month and hour slots
ggplot(day_hour, aes(day, Hour_slots, fill = Total)) +
geom_tile(color = "white") +
ggtitle("Heat Map by Hour and Day")Plot 3: Heat Map by hourslots and day of month
month_levels <- date_hour %>% group_by(month_names) %>% dplyr::summarize(Total = n())
ggplot(month_levels, aes(month_names, Total, fill = month_names)) +
geom_bar(stat = "Identity") +
ggtitle("Trips in a month") +
theme(legend.position = "none") +
scale_y_continuous(labels = comma)Plot 4: Total number of trips by month
day_month_data <- date_hour %>% group_by(day_of_week, month_names) %>% dplyr::summarize(Trips = n())
#> `summarise()` has grouped output by 'day_of_week'. You can override using the
#> `.groups` argument.
ggplot(day_month_data, aes(day_of_week, month_names, fill = Trips)) +
geom_tile(color = "white") +
ggtitle("Heat Map by Month and Day")Plot 5: Total number of trips by day of week and month
ggplot(day_month_data, aes(day_of_week, Trips, fill = month_names)) +
geom_bar(stat = "identity", aes(fill = month_names), position = "dodge") +
ggtitle("Trips by Day and Month") +
scale_y_continuous(labels = comma) +
coord_cartesian(ylim=c(150000,500000))Plot 6: Trips by day of week and month
ggplot(day_month_data, aes(day_of_week, Trips, fill = day_of_week)) +
geom_boxplot() +
scale_y_log10() +
ggtitle("Distribution of Trips by Weekday") +
theme(legend.position = "top") +
coord_flip()Plot 7: Distribution of trips by day of the week
head(date_hour)
#> sourceid dstid month day start_hour end_hour hour_add
#> 1 102 97 3 13 10 16 26
#> 2 98 55 2 11 0 7 7
#> 3 148 111 1 11 16 19 35
#> 4 58 22 2 8 16 19 35
#> 5 54 62 1 18 16 19 35
#> 6 143 161 2 1 16 19 35
#> Hour_slots month_names year date_format day_of_week
#> 1 Mid-Day March 2022 2022-03-13 Sun
#> 2 Midnight to Early Morning Febraury 2022 2022-02-11 Fri
#> 3 Evening Peak January 2022 2022-01-11 Tue
#> 4 Evening Peak Febraury 2022 2022-02-08 Tue
#> 5 Evening Peak January 2022 2022-01-18 Tue
#> 6 Evening Peak Febraury 2022 2022-02-01 Tue
boundaries$MOVEMENT_ID <- as.integer(boundaries$MOVEMENT_ID)
date_hour <- left_join(date_hour, boundaries %>% dplyr::select(WARD_NAME, MOVEMENT_ID), by = c('sourceid' = 'MOVEMENT_ID'))
date_hour <- date_hour %>% rename(pickup_area = WARD_NAME)
date_hour <- left_join(date_hour, boundaries %>% dplyr::select(WARD_NAME, MOVEMENT_ID), by = c('dstid' = 'MOVEMENT_ID'))
date_hour <- date_hour %>% rename(drop_area = WARD_NAME)
wardname_pickups <- date_hour %>%
group_by(pickup_area) %>%
tally() %>%
arrange(desc(n))
datatable(wardname_pickups)
by_month_ward_name <- date_hour %>%
group_by(month, pickup_area, month_names, day) %>%
tally() %>%
arrange(pickup_area, month)
datatable(by_month_ward_name)
ward_by_wkday <- date_hour%>%
group_by(day_of_week, pickup_area) %>%
tally() %>%
arrange(pickup_area, day_of_week)
datatable(ward_by_wkday)
wkday_ward_hslots <- date_hour %>%
group_by(day_of_week, pickup_area, Hour_slots) %>%
tally() %>%
arrange(pickup_area, Hour_slots , day_of_week)
datatable(wkday_ward_hslots)
plot_usage_growth <- by_month_ward_name %>%
filter(pickup_area %in% c("Subhash Nagar","Kadugodi","Begur","Kempegowda Ward", "Kengeri", "Koramangala")) %>%
mutate(Date = paste(day, month, "2022", sep = "-")) %>%
mutate(Date = dmy(Date)) %>%
ggplot(aes(Date, n, colour = pickup_area)) +
geom_line() +
theme_bw() +
ggtitle("Uber Pick-Ups Growth")
plot_usage_growth Plot 8: Uber Pickup Trend for five wards in Bengaluru
plot_all_pickup <- wkday_ward_hslots %>%
filter(pickup_area %in% c("Subhash Nagar","Kadugodi","Begur","Kempegowda Ward", "Kengeri", "Koramangala")) %>%
ggplot(aes(day_of_week, Hour_slots)) +
geom_point(aes(size = n, colour = n)) +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Pick-ups by Day, Time and Pickup Area") +
ylab("Time") +
facet_grid(pickup_area ~ .) +
scale_colour_gradient(low = "lightsteelblue", high= "midnightblue")
plot_all_pickupPlot 9: Uber pickups by week, slots and pickup area from five wards in Bengaluru
plot_Koramangala <- wkday_ward_hslots %>%
filter(pickup_area %in% c("Koramangala")) %>%
ggplot(aes(day_of_week, Hour_slots)) +
geom_point(aes(size = n, colour = n)) +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Pick-ups by Day, Time in Koramangala") +
ylab("Time") +
scale_colour_gradient(low = "lightsteelblue", high= "midnightblue")
plot_KoramangalaPlot 10: Pickups by day of week, hourslots from Koramangala
heatmap_Koramangala <- wkday_ward_hslots %>%
filter(pickup_area %in% c("Koramangala")) %>%
ggplot(aes(day_of_week, Hour_slots, fill = n/3)) +
geom_tile(color="white", size=0.1) +
scale_fill_viridis(name="# Events/hour") +
coord_equal() +
labs(x=NULL, y= "Time", title="Pick-Ups per weekday & time of day in Koramangala")
heatmap_KoramangalaPlot 11: Heatmap by day of week and hour slots from Koramangala region
heatmap_other_pickuparea <- wkday_ward_hslots %>%
filter(pickup_area %in% c("Kadugodi","Begur", "Koramangala")) %>%
ggplot(aes(day_of_week, Hour_slots, fill = n/3)) +
geom_tile(color="white", size=0.1) +
scale_fill_viridis(name="# Events/hour") +
coord_equal() +
labs(x=NULL, y= "Time", title="Pick-Ups per weekday & time of day other pick-up area") +
facet_grid(. ~ pickup_area)
heatmap_other_pickupareaPlot 12: Similar heatmap with other areas namely, Kadugodi, Begur along with Koramangala
a <- ggplot(boundaries, aes(x = long, y = lat)) +
geom_point() +
coord_equal() +
xlab('Longitude') +
ylab('Latitude')
b <- ggplot(boundaries, aes(x = long, y = lat)) +
coord_equal() +
xlab('Longitude') +
ylab('Latitude') +
stat_density2d(aes(fill = ..level..), alpha = .5,
geom = "polygon", data = boundaries) +
scale_fill_viridis_c() +
theme(legend.position = 'none')
ggarrange(a, b + rremove("x.text"),
ncol = 2, nrow = 1)Plot 13: Centroid of the city
boundaries_geom <- boundaries %>% dplyr::select(MOVEMENT_ID, geometry)
bangalore_union <- st_union(boundaries_geom)
#> although coordinates are longitude/latitude, st_union assumes that they are planar
boundaries_geom_p <- boundaries %>% dplyr::select(MOVEMENT_ID, long, lat, WARD_NAME)
wardname_pickups_map <- date_hour %>%
group_by(pickup_area, sourceid) %>%
tally() %>%
arrange(desc(n))
datatable(wardname_pickups_map)Plot 14: Total number of pickups with the wardname for the first three months
wardname_pickups_map <- as.data.frame(wardname_pickups_map)
wardname_pickups_map <- wardname_pickups_map %>% rename(Total_Pickups = n)
wardname_pickups_map <- merge(x=wardname_pickups_map, y=boundaries_geom, by.x='sourceid', by.y='MOVEMENT_ID')
wardname_pickups_map <- st_as_sf(wardname_pickups_map)
mapviewOptions(fgb = FALSE)
wardname_pickups_map %>% mapview(zcol = "Total_Pickups", legend = TRUE, col.regions = sf.colors)Plot 14: Total number of pickups with the wardname for the first three months
ggplot() +
geom_sf(data = bangalore_union, fill="grey", alpha=0.3) +
geom_point(data=boundaries_geom_p, aes(x=long, y=lat, size=wardname_pickups_map$Total_Pickups, color=wardname_pickups_map$Total_Pickups)) +
scale_size_continuous(range=c(1,12)) +
scale_color_viridis(trans="log") +
theme_void()+ coord_sf(xlim = c(77.45, 77.80), ylim = c(12.80, 13.15), expand = FALSE)Plot 15: Bubble Plots for the number of pickups in the first quarter of 2020
Plot 1: Although definitive pattern is not observed with this plot, total number of trips by day in March shows a significant decline. This can be attributed to the first instance of covid19 situation in Bengaluru. It can also be observed that, there have been uber trips even after the declaration of Nationwide lockdown, although drastically reduced from normal.
Plot 2: Bar plot provides a clear idea that the demand for uber cars and autos is higher in the mid-day hours compared to the morning peak(7AM - 10AM) and evening peak hours(4PM - 7PM). To my surprise, there is also significant amount of demand in the midnight to early morning hour slots. Given the number of hours for each slots, highest demand for uber is spotted at the late after sunset, compared to evening and morning peak hours.
Plot 3: In general, there is a high demand in the mid day hours all through the three months
Plot 4: Total Trips by month plot shows clear decline in the number of trips over three months, this can be partly attributed to the covid situation.
Plot 5: On examining the heat map, although by general trend, highest demand is seen in the weekend. There is also equally similar demands in Mondays for the three months
Plot 6: On viewing this plot, different picture emerges. Month of February and March shows high demand during Wednesday compared to the weekends and Mondays.
Plot 7: In this plot, Mondays show the highest of bandwidths of demand compared to the even the weekends
Plot 8: Trendlines provides demand by ward names over the three months. Subhash Nagar shows highest demand, Kadugodi shows the lowest demand. This code snippet can be used to find the trendlines for any location in Bengaluru. Drastic decline is spotted towards the end of March.
Plot 9: This plot, provides demand by hour slots and day of week for six wards in Bengaluru. Subhash Nagar shows demand in all segments, whereas Kadugodi shows low demand in all segments and also shows general trend as exhibited in other plots.
Plot 10: This plot exhibits demand in Koramangala area in Bengaluru. This plot provides clear picture of uber pickup demand. Generally increasing demand is observed, peaking after sunset in all days. Lowest demand is found over the Early mornings of Fridays, this observation is significant as the lowest demand is not over the mid days of the week, but on Fridays.
Plot 11: This plot shows correspondence to the previous plot
Plot 12: Provides visible changes in the demand on comparison of Koramangala’s demand with others.
Plot 13: Provides a brief idea of concentration of wards in the Bengaluru city
Plot 14: Provides map of demands of uber for three months of 2020 in Bengaluru. This map provides a nice idea of the demands by wards in Bengaluru.
Plot 15: This bubble plot provides the concentration of demand in and around the Bengaluru city.
These plots provides insights on demands in Bengalure City. To my surprise, different type of plots from the same dataset provides different insights. Although the scope of analysis is less as the dataset is only for the first quarter of 2020, this project can be a good starting point for larger analysis. Each and every plots can be scaled up for better insights.